Search CORE

130 research outputs found

Optimization by gradient boosting

Author: Biau Gérard
Cadre Benoît
Publication venue
Publication date: 16/07/2017
Field of study

Gradient boosting is a state-of-the-art prediction technique that sequentially produces a model in the form of linear combinations of simple predictors---typically decision trees---by solving an infinite-dimensional convex optimization problem. We provide in the present paper a thorough analysis of two widespread versions of gradient boosting, and introduce a general framework for studying these algorithms from the point of view of functional optimization. We prove their convergence as the number of iterations tends to infinity and highlight the importance of having a strongly convex risk functional to minimize. We also present a reasonable statistical context ensuring consistency properties of the boosting predictors as the sample size grows. In our approach, the optimization procedures are run forever (that is, without resorting to an early stopping strategy), and statistical regularization is basically achieved via an appropriate

L^2

penalization of the loss and strong convexity arguments

arXiv.org e-Print Archive

Hal-Diderot

HAL-Rennes 1

Analysis of a Random Forests Model

Author: Bin Yu
Gérard Biau Lsta
Publication venue
Publication date: 01/01/2012
Field of study

Random forests are a scheme proposed by Leo Breiman in the 2000's for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman in \cite{Bre04}, which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present

arXiv.org e-Print Archive

CiteSeerX

Hal-Diderot

The Statistical Performance of Collaborative Inference

Author: Biau Gérard
Bleakley Kevin
Cadre Benoit
Publication venue
Publication date: 01/07/2015
Field of study

The statistical analysis of massive and complex data sets will require the development of algorithms that depend on distributed computing and collaborative inference. Inspired by this, we propose a collaborative framework that aims to estimate the unknown mean

\theta

of a random variable

X

. In the model we present, a certain number of calculation units, distributed across a communication network represented by a graph, participate in the estimation of

\theta

by sequentially receiving independent data from

X

while exchanging messages via a stochastic matrix

A

defined over the graph. We give precise conditions on the matrix

A

under which the statistical precision of the individual units is comparable to that of a (gold standard) virtual centralized estimate, even though each unit does not have access to all of the data. We show in particular the fundamental role played by both the non-trivial eigenvalues of

A

and the Ramanujan class of expander graphs, which provide remarkable performance for moderate algorithmic cost

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Statistical analysis of $k$ -nearest neighbor collaborative recommendation

Author: Biau Gérard
Cadre Benoît
Rouvière Laurent
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

Collaborative recommendation is an information-filtering technique that attempts to present information items that are likely of interest to an Internet user. Traditionally, collaborative systems deal with situations with two types of variables, users and items. In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists which would allow us to precisely describe the mathematical forces driving collaborative filtering. To provide an initial contribution to this, we propose to set out a general sequential stochastic model for collaborative recommendation. We offer an in-depth analysis of the so-called cosine-type nearest neighbor collaborative method, which is one of the most widely used algorithms in collaborative filtering, and analyze its asymptotic performance as the number of users grows. We establish consistency of the procedure under mild assumptions on the model. Rates of convergence and examples are also provided.Comment: Published in at http://dx.doi.org/10.1214/09-AOS759 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Long signal change-point detection

Author: Biau Gérard
Bleakley Kevin
Mason David
Publication venue
Publication date: 07/04/2015
Field of study

The detection of change-points in a spatially or time ordered data sequence is an important problem in many fields such as genetics and finance. We derive the asymptotic distribution of a statistic recently suggested for detecting change-points. Simulation of its estimated limit distribution leads to a new and computationally efficient change-point detection algorithm, which can be used on very long signals. We assess the algorithm via simulations and on previously benchmarked real-world data sets

arXiv.org e-Print Archive

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Consistency of random forests

Author: Biau Gérard
Scornet Erwan
Vert Jean-Philippe
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/08/2015
Field of study

Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultaneously analyze both the randomization process and the highly data-dependent tree structure. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's [Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity. 1. Introduction. Random forests are an ensemble learning method for classification and regression that constructs a number of randomized decision trees during the training phase and predicts by averaging the results. Since its publication in the seminal paper of Breiman (2001), the procedure has become a major data analysis tool, that performs well in practice in comparison with many standard methods. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes, high-dimensional feature spaces and complex data structures. The random forest methodology has been successfully involved in many practical problems, including air quality prediction (winning code of the EMC data science global hackathon in 2012, see http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al. (2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3

arXiv.org e-Print Archive

HAL-MINES ParisTech

Hal-Diderot